STATS 32 Session 8: A Crash Course in Statistics and Modeling

Kenneth Tay

Oct 25, 2018

Announcements

Project due on 2 Nov (Fri) 23:59:59

Remaining office hours:

10am-12pm, Sequoia Hall Rm 207

Recap of session 7

Agenda for today

A very high level picture: for technical details, take STATS 60/STATS 101

Today’s dataset: Top 100 songs on Spotify

(Source: Spotify)

Tempo by mode: Is there a difference?

Hard to tell from the histograms:

Look at mean tempo for each mode

Is this difference significant? What do we mean by significance anyway?

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
  3. Is the p-value considered low or not?
    • Threshold should depend on the context
    • Typical thresholds, 0.1, 0.05, 0.01

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
  3. Is the p-value considered low or not?
    • Threshold should depend on the context
    • Typical thresholds, 0.1, 0.05, 0.01
  4. If p-value is below threshold, 2 possible conclusions:
    • A rare event just happened, or
    • Our assumption in Step 1 was false

Tempo by mode: Is there a difference?

Two options:

What is a model?

Two steps to modeling

Step 1: Identify a family of models which express a generic pattern between your variables of interest.

Possible model family: Linear model, i.e. \(child = a_1 + a_2 \times parent\).

Many other possible models: linear without intercept, quadratic, exponential, …

Different models within the linear model family

Each line corresponds to a choice of \(a_1\) and \(a_2\).

Two steps to modeling

Step 2: Find the model in this family that most closely matches your data.

That is, find specific values of \(a_1\) and \(a_2\) which make the model match the data most closely.

What do we mean by “closely matching the data”?

We choose \(a_1\) and \(a_2\) such that some objective function (loss function) is minimized.

Most common objective: Minimize the sum of squares of the black lines below.

(Source: uc-r.github.io)

Linear models in R

Models with categorical variables

Consider modeling valence ~ mode.

Additive models

Formula valence ~ loudness + mode translates to

Models with interaction

Formula valence ~ loudness * mode translates to

Summary of the course

Where do we go from here?

Other Stanford courses

Thank you! :)